Lab
3
Extending
Logistic
Regression
By
Lawrence
Lim
and
Matthew
Grover
for
CS
5324
Machine
Learning
in
Python,
due
3/12/2023
Data
Selection
and
Business
Understanding
Since
the
dataset
used
in
Lab
1
is
a
binary
classification
task,
it
cannot
be
used
for
this
lab.
Instead,
we
searched
for
a
new
dataset
with
a
multi-class
prediction
task.
One-Versus-All
Multiclass
Classification
separates
a
multiclass
problem
into
several
binary
classification
problems,
with
one
binary
classifier
for
each
class
in
the
multiclass
problem.
Cancer
is
a
severe
disease
that
affects
millions
of
people
worldwide.
According
to
the
American
Cancer
Society,
approximately
1.8
million
new
cancer
cases
were
diagnosed
in
the
United
States
in
2020.
And
cancer
remained
the
2nd
largest
killer
behind
heart
disease
in
2022.
The
earlier
cancer
is
diagnosed,
the
better
the
chances
of
successful
treatment
and
recovery.
The
data
set
I
am
using
is
from
Kaggle
https://www.kaggle.com/datasets/allanwandia/cancer-diagnosis.
It
consists
of
information
about
a
patient's
cancer
diagnosis,
such
as
the
type
of
tumor,
the
age
the
patient
was
diagnosed,
their
gender/ethnicity,
and
the
state
of
various
genes
or
cells
in
the
patient
(i.e.:
whether
those
particular
cells
mutated
or
not).
Having
the
latter
information
is
especially
significant
as
it
can
provide
insights
into
cell
patterns
in
relation
to
the
type
of
tumors
in
all
or
a
subset
of
patients.
My
analysis
aims
to
develop
a
predictive
model
that
can
accurately
determine
the
type
of
tumor
formed
based
on
the
various
attributes
provided
in
the
data
set.
A
successful
and
accurate
model
could
help
medical
professionals
diagnose
cancer
earlier,
maximizing
the
chance
that
a
cancerous
tumor
can
be
effectively
removed
(when
it
is
smaller
and
less
developed),
leading
to
better
patient
outcomes.
The
data
set
could
also
provide
insights
into
the
other
factors
that
contribute
to
cancer
as
specified
by
recorded
patients'
attributes
like
age
and
gender.
Researchers
could
use
this
information
to
understand
the
disease
better
and
develop
new
treatments.
The
stakeholders
who
may
be
interested
in
this
analysis
include
medical
professionals,
cancer
patients,
their
families,
and
researchers
in
the
field
of
oncology.
Medical
professionals
could
use
the
predictive
model
to
diagnose
breast
cancer
more
accurately
and
efficiently,
leading
to
better
patient
outcomes.
Cancer
patients
and
their
families
could
benefit
from
early
diagnosis
and
treatment,
improving
the
patient's
chances
of
survival.
Researchers
could
use
the
data
set
to
identify
new
risk
factors
and
treatment
options
for
breast
cancer.
The
success
of
the
predictive
model
will
be
measured
by
its
ability
to
predict
the
type
of
tumors
effectively.
A
successful
model
should
have
a
high
accuracy
rate,
preferably
above
95
%
.
The
model
should
also
be
able
to
identify
false
positives
and
false
negatives,
as
misdiagnosis
can
have
severe
consequences
for
the
patient.
Finally,
concerning
deployment
for
online
versus
offline
analysis,
utilizing
an
online
system
or
database
that
aggregated
patient
cancer
records
from
hospitals
nationwide
would
be
best.
This
is
important
as
the
prevalence
of
tumors
and
location
factors
can
also
influence
cancer
prevalence.
For
instance,
whether
or
not
a
person
lives
in
a
city
with
heavy
pollution.
Thus,
if
we
were
to
have
an
offline
model
that
only
takes
cancer
diagnosis
from
one
ITOM3355
Fall
22
/
Lab_1_Lawrence_and_Matthew
Published
at
Mar
12,
2023
Private
hospital
from
one
city
with
more
pollution,
the
prevalence
of
certain
tumors
might
be
biased
in
a
manner
that
is
not
explained
by
the
given
genes.
In
summary,
increasing
the
scope
of
our
study
through
a
database
accepting
online
inputs
from
doctors
across
the
nation
ensure
that
our
study
can
better
"target"
the
relationship
between
solely
genetics
and
cancer
tumor,
with
individual
city
biases
not
being
overwhelmingly
significant.
Dataset
Preprocessing
and
Feature
Summary
1.
Dimensionality
Reduction
and
Data
Cleaning
import
pandas
as
pd
import
numpy
as
np
import
math
import
matplotlib.pyplot
as
plt
import
seaborn
as
sns
# https://www.kaggle.com/datasets/allanwandia/cancer-diagnosis
df = pd.read_csv("/work/Cancer Diagnosis - TCGA_GBM_LGG_Mutations_all.csv")
each_individual_unique_bool = df["Case_ID"].is_unique
print
("")
print
("the premise that each individual id the DF is unique is,", each_individual_unique_bool)
print
("")
# display feature list, along with their data types
df.info()
df.head()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Grade 857 non-null object
1 Project 857 non-null object
2 Case_ID 857 non-null object
3 Gender 857 non-null object
4 Age_at_diagnosis 857 non-null object
5 Primary_Diagnosis 857 non-null object
6 Race 857 non-null object
7 IDH1 857 non-null object
8 TP53 857 non-null object
9 ATRX 857 non-null object
10 PTEN 857 non-null object
11 EGFR 857 non-null object
12 CIC 857 non-null object
13 MUC16 857 non-null object
14 PIK3CA 857 non-null object
15 NF1 857 non-null object
16 PIK3R1 857 non-null object
17 FUBP1 857 non-null object
18 RB1 857 non-null object
19 NOTCH1 857 non-null object
20 BCOR 857 non-null object
21 CSMD3 857 non-null object
Below,
we
can
see
the
relative
distribution
of
the
different
type
of
cancer
tumor
diagnosis
for
all
patients.
It
is
important
to
note
that
though
the
types
of
tumors
vary
in
prevalence,
there
is
no
single
type
of
tumor
that
is
incredibly
rare
(i.e:
only
occurs
a
few
times).
This
is
important
as
having
a
few
types
of
cancer
tumors
that
are
all
incredibly
rare
can
make
it
more
difficult
for
the
training
data
to
be
able
to
adequately
prepare
the
model
to
predict
those
specific
types
of
cancer.
tumor_counts = df["Primary_Diagnosis"].value_counts()
# Create a bar chart of the tumor counts
plt.bar(tumor_counts.index, tumor_counts.values)
plt.xticks(rotation=90)
plt.xlabel("Tumor Type")
plt.ylabel("Number of Cases")
plt.title("Number of Cancer Cases by Tumor Type")
plt.show()
memory usage: 180.9+ KB
0
LGG
TCGALGG
TCGAHTA617
Male
47
years
165
days
Oligodendroglioma,
NOS
am
ala
1
GBM
TCGAGBM
TCGA140787
Male
69
years
124
days
Glioblastoma
asia
2
GBM
TCGAGBM
TCGA191786
Female
64
years
192
days
Glioblastoma
asia
3
GBM
TCGAGBM
TCGA062565
Male
59
years
10
days
Glioblastoma
asia
4
LGG
TCGALGG
TCGADUA7TB
Male
56
years
250
days
Oligodendroglioma,
NOS
asia
object
Project
object
Case_ID
object
Gender
object
Age_at_diagnosis
o
Primary_Diagnosis
o
Rac
For
the
below
chart
graph,
we
get
the
prevalence
of
each
genetic
feature
and
can
also
conclude
that
no
single
feature
has
too
small
a
prevalence
to
be
statistically
insignificant.
This
ensures
that
there
should
be
enough
of
each
particular
feature
to
allow
the
model
to
associate
the
feature
with
a
specific
type
of
cancer
(if
an
association
is
found).
# Get the value counts for each gene feature
gene_counts = df.iloc[:, 7:].apply(pd.value_counts).fillna(0).astype(int)
# Transpose the dataframe and sort the values in descending order
gene_counts = gene_counts.T.sort_values(by=1,ascending=False)
# Create a bar chart of the gene mutation counts
plt.bar(gene_counts.index, gene_counts[1])
plt.xticks(rotation=90)
plt.xlabel('Gene Feature')
plt.ylabel('Number of Mutations')
plt.title('Mutation Prevalence of Gene Features')
plt.show()
Since
our
prediction
task
is
to
identify
the
type
of
cancer
determined
by
the
primary
diagnosis
based
on
genetic
markers,
we
will
remove
all
features
of
the
data
except
for
the
primary
diagnosis
and
the
genetic
markers.
Furthermore,
it
is
important
to
note
that
all
of
our
data
in
non-null
as
specified
above.
Thus,
we
do
not
have
to
worry
about
imputation
or
deletion
of
particular
rows
nor
errors
in
calculations
caused
by
NAN
values.
del
df["Grade"]
del
df["Project"]
del
df["Case_ID"]
del
df["Gender"]
del
df["Age_at_diagnosis"]
del
df["Race"]
The
next
step
is
to
replace
all
of
the
categorical
data
with
numeric
encoding.
This
is
important
as
NOTE
To
simplify
the
multiclass
classification
task,
we
will
consolidate
NOS
and
Anaplastic
tumors
of
the
same
type.
df['Primary_Diagnosis'].replace(['Astrocytoma, NOS', 'Astrocytoma, anaplastic', 'Glioblastoma', 'Mixed gliom
[0, 0, 1, 1, 2, 2], inplace=True)
df['IDH1'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['TP53'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['ATRX'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['PTEN'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['EGFR'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['CIC'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['MUC16'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['PIK3CA'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['NF1'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['PIK3R1'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['FUBP1'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['RB1'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['NOTCH1'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['BCOR'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['CSMD3'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['SMARCA4'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['GRIN2A'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['IDH2'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['FAT4'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df['PDGFRA'].replace(['NOT_MUTATED', 'MUTATED'], [0, 1], inplace=True)
df.head()
2.
Feature
Summary
Primary_Diagnosis:
The
prediction
goal
we
have
determined
for
this
lab.
A
0
represents
Astrocytoma,
a
1
represents
Glioma,
and
a
2
represents
Oligodendroglioma.
The
remaining
features
in
the
dataframe
refer
to
genetic
mutations
in
the
cancer
patient's
DNA.
A
0
means
that
the
mutation
does
not
exist,
while
a
1
represents
that
the
mutation
exists.
df.describe()
0
2
0
0
0
0
0
1
1
0
0
0
0
0
2
1
0
0
0
0
0
3
1
0
0
0
1
1
4
2
0
0
0
0
0
Primary_Diagnosis
i
IDH1
int64
TP53
int64
ATRX
int64
PTEN
int64
EGFR
int64
CIC
count
857.0
857.0
857.0
857.0
857.0
857.0
mean
0.9941656942823
804
0.48074679113185
53
0.41190198366394
4
0.2555425904317
386
0.16686114352392
065
0.13185530921820
304
std
0.6565368896464
255
0.4999209307266
4295
0.49246495174715
89
0.4364204607594
0355
0.37306957407194
646
0.3385309626680
848
min
0.0
0.0
0.0
0.0
0.0
0.0
25
%
1.0
0.0
0.0
0.0
0.0
0.0
50
%
1.0
0.0
0.0
0.0
0.0
0.0
Primary_Diagnosis
f
IDH1
float64
TP53
float64
ATRX
float64
PTEN
float64
EGFR
float64
3.
Training
/
Testing
Split
and
80/20
An
80/20
split
should
be
appropriate
for
our
dataset
as
it
allows
a
sufficient
amount
of
information
to
adequately
train
the
model
while
leaving
enough
data
to
run
proper
tests.
In
our
case,
170
rows
are
used
for
testing,
and
685
rows
are
used
in
training.
If
we
were
to
increase
our
training
data
to
90
%
and
only
leave
10
%
to
test
with,
that
would
leave
us
only
around
85
rows
to
test.
This
is
significant
as
1-row
accounts
for
approximately
1.2
%
of
the
testing
set.
Thus,
if
there
are
just
three
or
so
rows
in
the
latter
part
of
the
dataset
(the
testing
set)
that
aren't
accounted
for
in
training,
it
could
skew
the
accuracy
values
more
significantly.
However,
with
170
rows,
a
few
rows
that
the
training
hasn't
prepared
for
cannot
significantly
impact
the
accuracy.
It
is
also
important
to
note
that
we
are
trying
to
balance
between
overfitting
and
underfitting
the
model.
Suppose
we
have
a
proportion
of
training
data
that
is
too
large
(like
90
%
.
In
that
case,
it
can
lead
to
inaccuracies
in
testing
as
the
model
is
overfitted
to
the
point
where
it
cannot
adapt
to
any
new
variations
in
the
testing
data.
On
the
other
hand,
not
training
the
model
enough
(ex:
a
5050
split),
can
lead
to
a
model
that
isn't
prepared
for
the
data,
as
the
training
hasn't
fully
accounted
for
the
reasoning
behind
the
variation.
It
is
finally
important
to
note
that
we,
of
course,
are
assuming
that
the
attributes
of
our
data
are
related
to
the
presence
of
the
particular
brain
tumors
accounted
for
in
this
data
set.
If
these
genetic
factors
and
other
attributes
like
age
and
ethnicity
are
entirely
unrelated
to
what
we
try
to
classify,
how
we
split
the
deck
will
not
matter.
# convert from pandas dataframes to numpy arrays
y = df["Primary_Diagnosis"].to_numpy()
X = df.copy()
del
X["Primary_Diagnosis"]
X = X.to_numpy()
from
sklearn.model_selection
import
ShuffleSplit
# from: https://github.com/eclarson/MachineLearningNotebooks/blob/master/06.%20Optimization.ipynb
# to use the cross validation object in scikit learn, we need to grab an instance
# of the object and set it up. This object will be able to split our data into
# training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations, test_size = 0.2)
print
(cv_object)
Though
an
8020
split
should
work,
it
is
good
to
confirm
that
this
is
the
case.
Below,
we
perform
a
simple
mean
square
error
test
to
determine
which
split
choice
reduces
the
error
amounts.
Rerunning
the
code
segment
below,
we
consistently
get
that
testing
data
of
2030
%
size
usually
yields
the
lowest
MSE
or
the
more
accurate
results.
Thus,
a
7030
or
8020
split
is
ideal
for
our
data.
from
sklearn.model_selection
import
train_test_split
from
sklearn.linear_model
import
LinearRegression
from
sklearn.metrics
import
mean_squared_error
# define a range of train-test splits to evaluate
75
%
1.0
1.0
1.0
1.0
0.0
0.0
max
2.0
1.0
1.0
1.0
1.0
1.0
ShuffleSplit(n_splits=3, random_state=None, test_size=0.2, train_size=None)
splits = [0.2, 0.3, 0.4, 0.5, 0.6,0.7,0.8]
# define empty list to store MSEs for each split
mse_values = []
best_split = None
lowest_mse = float('inf')
# loop over each split
for
split
in
splits:
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split)
# train a linear regression model on the training set
model = LinearRegression()
model.fit(X_train, y_train)
# make predictions on the test set
y_pred = model.predict(X_test)
# calculate the mean squared error for the predictions
mse = mean_squared_error(y_test, y_pred)
# add the mse to the list of mse values
mse_values.append(mse)
if
mse < lowest_mse:
lowest_mse = mse
best_split = split
print
("the split values are",mse_values)
print
("")
print
("The best split is:", best_split)
print
("the lowest mse is", lowest_mse)
Custom
Logistic
Regression
Classifier
1.
Starting
Classifier
Code
# from: https://github.com/eclarson/MachineLearningNotebooks/blob/master/05.%20Logistic%20Regression.ipynb
from
scipy.special
import
expit
from
sklearn.metrics
import
accuracy_score
class
BinaryLogisticRegressionBase:
# private:
def
__init__(self, eta, iterations=20):
self.eta = eta
self.iters = iterations
# internally we will store the weights as self.w_ to keep with sklearn conventions
def
__str__(self):
return
'Base Binary Logistic Regression Object, Not Trainable'
the split values are [0.29945842587501276, 0.3325816777646879, 0.3246970461826918, 0.33752088504303696, 0.35122199552083644, 0
The best split is: 0.2
the lowest mse is 0.29945842587501276
# convenience, private and static:
@staticmethod
def
_sigmoid(theta):
return
1/(1+np.exp(-theta))
@staticmethod
def
_add_intercept(X):
return
np.hstack((np.ones((X.shape[0],1)),X))
# add bias term
# public:
def
predict_proba(self, X, add_intercept=True):
# add bias term if requested
Xb = self._add_intercept(X)
if
add_intercept
else
X
return
self._sigmoid(Xb @ self.w_)
# return the probability y=1
def
predict(self,X):
return
(self.predict_proba(X)>0.5)
#return the actual prediction
class
BinaryLogisticRegression(BinaryLogisticRegressionBase):
#private:
def
__str__(self):
if
(hasattr(self,'w_')):
return
'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_)
# is we have train
else
:
return
'Untrained Binary Logistic Regression Object'
def
_get_gradient(self,X,y):
# programming \sum_i (yi-g(xi))xi
gradient = np.zeros(self.w_.shape)
# set gradient to zero
for
(xi,yi)
in
zip(X,y):
# the actual update inside of sum
gradi = (yi - self.predict_proba(xi,add_intercept=False))*xi
# reshape to be column vector and add to gradient
gradient += gradi.reshape(self.w_.shape)
return
gradient/float(len(y))
# public:
def
fit(self, X, y):
Xb = self._add_intercept(X)
# add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1))
# init weight vector to zeros
# for as many as the max iterations
for
_
in
range(self.iters):
gradient = self._get_gradient(Xb,y)
self.w_ += gradient*self.eta
# multiply by learning rate
class
VectorBinaryLogisticRegression(BinaryLogisticRegression):
# inherit from our previous class to get same functionality
@staticmethod
def
_sigmoid(theta):
# increase stability, redefine sigmoid operation
return
expit(theta)
#1/(1+np.exp(-theta))
# but overwrite the gradient calculation
def
_get_gradient(self,X,y):
ydiff = y-self.predict_proba(X,add_intercept=False).ravel()
# get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
# make ydiff a column vector and multiply throug
return
gradient.reshape(self.w_.shape)
class
LogisticRegression:
def
__init__(self, eta, iterations=20):
self.eta = eta
self.iters = iterations
# internally we will store the weights as self.w_ to keep with sklearn conventions
def
__str__(self):
if
(hasattr(self,'w_')):
return
'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_)
# is we have t
else
:
return
'Untrained MultiClass Logistic Regression Object'
def
fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y)
# get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = []
# will fill this array with binary classifiers
for
i,yval
in
enumerate(self.unique_):
# for each unique value
y_binary = (y==yval)
# create a binary problem
# train the binary classifier for this class
blr = VectorBinaryLogisticRegression(self.eta,
self.iters)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_
for
x
in
self.classifiers_]).T
def
predict_proba(self,X):
probs = []
for
blr
in
self.classifiers_:
probs.append(blr.predict_proba(X))
# get probability for each classifier
return
np.hstack(probs)
# make into single matrix
def
predict(self,X):
return
self.unique_[np.argmax(self.predict_proba(X),axis=1)]
# take argmax along row
lr = LogisticRegression(0.1,1500)
lr.fit(X,y)
print
(lr)
yhat = lr.predict(X)
print
('Accuracy of: ',accuracy_score(y,yhat))
MultiClass Logistic Regression Object with coefficients:
[[-2.00789992 0.86670863 0.51244855 0.54206735 -0.35255891 0.24285564
-1.47044138 -0.30459214 -0.01454668 0.6699159 -0.28952282 -0.52830559
-0.24733512 -0.15071247 -0.01318224 0.11942813 0.25293933 -0.06662209
-0.36064693 0.4892937 -0.12044242]
[ 1.2280023 -1.95370621 0.24182016 -0.0258706 0.63794481 -0.07189571
-0.56936955 0.38041765 -0.22627682 -0.27456283 0.33317956 -0.19230557
0.30445355 -0.6823748 0.23499388 -0.03222884 -0.53934279 0.39451661
-1.01787548 -0.33010579 0.07068744]
2.
Adding
Multiple
Optimization
Techniques
and
Adjustable
"C"
%%time
class
BinaryLogisticRegression:
def
__init__(self, eta, iterations=20, C=0.001):
self.eta = eta
self.iters = iterations
self.C = C
# internally we will store the weights as self.w_ to keep with sklearn conventions
def
__str__(self):
if
(hasattr(self,'w_')):
return
'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_)
# is we have train
else
:
return
'Untrained Binary Logistic Regression Object'
# convenience, private:
@staticmethod
def
_add_bias(X):
return
np.hstack((np.ones((X.shape[0],1)),X))
# add bias term
@staticmethod
def
_sigmoid(theta):
# increase stability, redefine sigmoid operation
return
expit(theta)
#1/(1+np.exp(-theta))
# vectorized gradient calculation with regularization using L2 Norm
def
_get_gradient(self,X,y):
ydiff = y-self.predict_proba(X,add_bias=False).ravel()
# get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
# make ydiff a column vector and multiply throug
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += -2 * self.w_[1:] * self.C
return
gradient
# public:
def
predict_proba(self,X,add_bias=True):
# add bias term if requested
Xb = self._add_bias(X)
if
add_bias
else
X
return
self._sigmoid(Xb @ self.w_)
# return the probability y=1
def
predict(self,X):
return
(self.predict_proba(X)>0.5)
#return the actual prediction
def
fit(self, X, y):
Xb = self._add_bias(X)
# add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1))
# init weight vector to zeros
# for as many as the max iterations
[-2.01957993 1.67464097 -0.97585309 -0.43474744 -1.15894791 -0.56953157
1.41680131 -0.27148083 0.16660748 -0.73763847 -0.22549029 0.56782404
-0.33468881 0.80142205 -0.28155584 -0.07098278 0.38421845 -0.40564491
1.27438801 -0.12219563 -0.01825568]]
Accuracy of: 0.6779463243873979
for
_
in
range(self.iters):
gradient = self._get_gradient(Xb,y)
self.w_ += gradient*self.eta
# multiply by learning rate
# add bacause maximizing
Below
we
define
classes
for
each
optimization
technique
from
scipy.optimize
import
minimize_scalar
class
LineSearchLogisticRegression(BinaryLogisticRegression):
# define custom line search for problem
def
__init__(self, line_iters=0.0, **kwds):
self.line_iters = line_iters
# but keep other keywords
super().__init__(**kwds)
# call parent initializer
# this defines the function with the first input to be optimized
# therefore eta will be optimized, with all inputs constant
@staticmethod
def
objective_function(eta,X,y,w,grad,C):
wnew = w - grad*eta
g = expit(X @ wnew)
# the line search is looking for minimization, so take the negative of l(w)
return
-np.sum(ma.log(g[y==1]))-ma.sum(np.log(1-g[y==0])) + C*sum(wnew**2)
def
fit(self, X, y):
Xb = self._add_bias(X)
# add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1))
# init weight vector to zeros
# for as many as the max iterations
for
_
in
range(self.iters):
gradient = -self._get_gradient(Xb,y)
# minimization is in opposite direction
# do line search in gradient direction, using scipy function
opts = {'maxiter':self.line_iters}
# unclear exactly what this should be
res = minimize_scalar(self.objective_function,
# objective function to optimize
bounds=(0,self.eta*10),
#bounds to optimize
args=(Xb,y,self.w_,gradient,self.C),
# additional argument for objective f
method='bounded',
# bounded optimization for speed
options=opts)
# set max iterations
eta = res.x
# get optimal learning rate
self.w_ -= gradient*eta
# set new function values
# subtract to minimize
from
scipy.optimize
import
fmin_bfgs
# maybe the most common bfgs algorithm in the world
from
numpy
import
ma
class
BFGSBinaryLogisticRegression(BinaryLogisticRegression):
@staticmethod
CPU times: user 47 µs, sys: 24 µs, total: 71 µs
Wall time: 40.8 µs
def
objective_function(w,X,y,C):
g = expit(X @ w)
# invert this because scipy minimizes, but we derived all formulas for maximzing
return
-np.sum(ma.log(g[y==1]))-np.sum(ma.log(1-g[y==0])) + C*sum(w**2)
#-np.sum(y*np.log(g)+(1-y)*np.log(1-g))
@staticmethod
def
objective_gradient(w,X,y,C):
g = expit(X @ w)
ydiff = y-g
# get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
gradient = gradient.reshape(w.shape)
gradient[1:] += -2 * w[1:] * C
return
-gradient
# just overwrite fit function
def
fit(self, X, y):
Xb = self._add_bias(X)
# add bias term
num_samples, num_features = Xb.shape
self.w_ = fmin_bfgs(self.objective_function,
# what to optimize
np.zeros((num_features,1)),
# starting point
fprime=self.objective_gradient,
# gradient function
args=(Xb,y,self.C),
# extra args for gradient and objective function
gtol=1e-03,
# stopping criteria for gradient, |v_k|
maxiter=self.iters,
# stopping criteria iterations
disp=False)
self.w_ = self.w_.reshape((num_features,1))
class
StochasticLogisticRegression(BinaryLogisticRegression):
# stochastic gradient calculation
def
_get_gradient(self,X,y):
idx = int(np.random.rand()*len(y))
# grab random instance
ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False)
# get y difference (now scalar)
gradient = X[idx] * ydiff[:,np.newaxis]
# make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += -2 * self.w_[1:] * self.C
return
gradient
from
numpy.linalg
import
pinv
class
HessianBinaryLogisticRegression(BinaryLogisticRegression):
# just overwrite gradient function
def
_get_gradient(self,X,y):
g = self.predict_proba(X,add_bias=False).ravel()
# get sigmoid value for all classes
hessian = X.T @ np.diag(g*(1-g)) @ X - 2 * self.C
# calculate the hessian
ydiff = y-g
# get y difference
gradient = np.sum(X * ydiff[:,np.newaxis], axis=0)
# make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += -2 * self.w_[1:] * self.C
return
pinv(hessian) @ gradient
Here
we
define
a
multiclass
regression
object,
where
the
optimization
technique
can
be
determined
by
the
variable
solver.
Below
the
class'
definition,
we
test
each
configuration:
LineSearchLogisticRegression
Steepest
Descent
/
Line
Search
BFGSBinaryLogisticRegression
Rank
2
Hessian
/
Quasi
Newton
Method
StochasticLogisticRegression
Stochastic
Gradient
Descent
HessianBinaryLogisticRegression
Rank
1
Hessian
/
Newton's
Update
Method
class
MultiClassLogisticRegression:
def
__init__(self, eta, iterations=20,
C=0.0001,
solver=BFGSBinaryLogisticRegression):
self.eta = eta
self.iters = iterations
self.C = C
self.solver = solver
self.classifiers_ = []
# internally we will store the weights as self.w_ to keep with sklearn conventions
def
__str__(self):
if
(hasattr(self,'w_')):
return
'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_)
# is we have t
else
:
return
'Untrained MultiClass Logistic Regression Object'
def
fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.sort(np.unique(y))
# get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = []
for
i,yval
in
enumerate(self.unique_):
# for each unique value
y_binary = np.array(y==yval).astype(int)
# create a binary problem
# train the binary classifier for this class
hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
hblr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(hblr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_
for
x
in
self.classifiers_]).T
def
predict_proba(self,X):
probs = []
for
hblr
in
self.classifiers_:
probs.append(hblr.predict_proba(X).reshape((len(X),1)))
# get probability for each classifier
return
np.hstack(probs)
# make into single matrix
def
predict(self,X):
return
np.argmax(self.predict_proba(X),axis=1)
# take argmax along row
%%time
lr = MultiClassLogisticRegression(eta=1,
iterations=10,
C=0.01,
solver=LineSearchLogisticRegression
)
lr.fit(X,y)
#print(lr)
yhat = lr.predict(X)
print
('Accuracy of: ',accuracy_score(y,yhat))
%%time
lr = MultiClassLogisticRegression(eta=1,
iterations=10,
C=0.01,
solver=BFGSBinaryLogisticRegression
)
lr.fit(X,y)
#print(lr)
yhat = lr.predict(X)
print
('Accuracy of: ',accuracy_score(y,yhat))
%%time
lr = MultiClassLogisticRegression(eta=0.005,
iterations=500,
C=0.001,
solver=StochasticLogisticRegression
)
lr.fit(X,y)
#print(lr)
yhat = lr.predict(X)
print
('Accuracy of: ',accuracy_score(y,yhat))
%%time
lr = MultiClassLogisticRegression(eta=1,
iterations=10,
C=0.001,
solver=HessianBinaryLogisticRegression
)
lr.fit(X,y)
#print(lr)
Accuracy of: 0.6662777129521587
CPU times: user 73.6 ms, sys: 68.9 ms, total: 142 ms
Wall time: 95.9 ms
Accuracy of: 0.6639439906651109
CPU times: user 221 ms, sys: 215 ms, total: 436 ms
Wall time: 450 ms
Accuracy of: 0.5694282380396732
CPU times: user 81.4 ms, sys: 44.8 ms, total: 126 ms
Wall time: 135 ms
yhat = lr.predict(X)
print
('Accuracy of: ',accuracy_score(y,yhat))
3.
Creating
the
Optimal
Classifier
To
create
the
optimal
classifier,
we
must
determine
which
optimization
technique
to
use,
as
well
as
the
regulation
term.
From
the
above
tests,
Newton's
Update
method
gave
us
the
best
accuracy.
All
that
is
left
to
do
is
to
identify
the
ideal
regularization
term
"C".
Modifying
this
regularization
term
is
significant
as
it
determines
the
level
of
generalization
and
overfitting
of
the
model.
In
this
case,
having
a
smaller
regularization-controlling
C
value
can
allow
our
model
to
not
be
too
overfitted,
which
means
the
model
can
better
inerpret
and
adapt
to
new
testing
data
that
might
differ
from
the
training
data.
For
instance,
it
is
possible
that
our
training
data
may
not
account
correctly
for
particular
genes
that
happen
to
be
mutated
mostly
in
the
testing
set.
In
this
case,
the
model
must
not
overfit
to
the
few
points
in
the
training
data.
When
comparing
the
results
of
the
below
boxplot
with
the
accuracies
received
above,
we
see
that
the
optimal
C
value
with
the
best
performing
solver
is
justified
as
we
can
tailor
for
maximum
accuracy
and
consistency,
which
is
shown
by
the
box
plot's
mean
and
quartile
ranges.
However,
it
is
important
to
note
that
for
our
data,
the
best
C-value's
accuracy
was
not
statistically
better
than
the
accuracy
result
of
our
multiclassregression
in
our
starter
code,
which
was
66
%
accuracy.
from
sklearn
import
metrics
as
mt
# alternatively, we can also graph out the values using boxplots
num_cv_iterations = 20
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
test_size = 0.5)
def
lr_explor(cost):
lr_clf = MultiClassLogisticRegression(eta=1,
iterations=10,
C=float(cost),
solver=HessianBinaryLogisticRegression
)
#lr_clf = BFGSBinaryLogisticRegression(eta=0.1,iterations=10, C=float(cost)) # get object
acc = []
for
iter_num, (train_indices, test_indices)
in
enumerate(cv_object.split(X,y)):
lr_clf.fit(X[train_indices],y[train_indices])
# train object
y_hat = lr_clf.predict(X[test_indices])
# get test set predictions
acc.append(mt.accuracy_score(y[test_indices],y_hat))
acc = np.array(acc)
return
acc
costs = np.logspace(-5,1,20)
accs = []
for
c
in
costs:
accs.append(lr_explor(c))
# now show a boxplot of the data across c
from
matplotlib
import
pyplot
as
plt
%matplotlib inline
plt.boxplot(accs)
plt.xticks(range(1,len(costs)+1),['%.4f'%(c)
for
c
in
costs],rotation='vertical')
Accuracy of: 0.7036172695449242
CPU times: user 274 ms, sys: 393 ms, total: 667 ms
Wall time: 685 ms
plt.xlabel('C')
plt.ylabel('validation accuracy')
plt.show()
4.
Assessment
of
Our
Classifier,
Comparison
with
sci-kit-learn,
and
Deployment
Recommendations
Below
we
will
compare
the
results
of
the
best
logistic
regression
optimization
procedure
to
the
procedure
in
sci-kit
learn.
from
sklearn.linear_model
import
LogisticRegression
from
sklearn.model_selection
import
train_test_split
from
sklearn.metrics
import
accuracy_score
import
numpy
as
np
import
matplotlib.pyplot
as
plt
# define empty list to store accuracy scores
accuracy_scores = []
# loop over 20 iterations
for
i
in
range(20):
# split the data into training and testing sets with different random seeds
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=i)
# train a logistic regression model on the training set
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
# make predictions on the test set
y_pred = model.predict(X_test)
# calculate the accuracy score for the predictions
accuracy = accuracy_score(y_test, y_pred)
# add the accuracy score to the list of scores
accuracy_scores.append(accuracy)
# plot boxplots of the accuracy scores for each iteration
plt.boxplot(accuracy_scores)
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Boxplot of Accuracy Scores for Logistic Regression Model using Sci Kit Learn')
plt.show()
The
sci-kit
learn
library
enables
parallelization
of
the
calculations
across
multiple
processors.
By
separating
each
classification
and
performing
operations
in
lower-level
languages,
the
process
can
be
performed
more
quickly
than
an
unoptimized
linear
regression
done
manually.
Another
factor
contributing
to
the
Sci
Kit's
speed
and
accuracy
is
the
advanced
update
technique
being
used.
The
optimization
problem
in
sci-kit
learn
is
the
"lbfgs"
algorithm,
which
is
a
quasi-Newton
method
that
approximates
the
second
derivative
of
the
objective
function
using
an
estimate
of
the
Hessian
matrix.
Regularization
is
another
technique
used
to
improve
the
accuracy
of
the
logistic
regression
model.
It
helps
prevent
over-learning
by
introducing
a
penalty
term
in
the
objective
function
that
discourages
large
parameter
values.
This
penalty
term
can
be
controlled
using
a
hyperparameter
that
is
chosen
based
on
the
level
of
regularization
needed.
Given
our
accuracy
score
of
69
%
and
looking
at
the
optimal
classifier
chart
in
the
above
section
3,
we
find
that
sci-
kit
learn
does
indeed
perform
with
an
accuracy
score
similar
to
the
accuracies
of
the
best
"C"
selections
possible,
which
are
the
smaller
values
of
C.
It
is
finally
important
to
note
the
size
and
type
of
data
we
interpret
as
factors
of
sci
kit
performance.
Considering
that
our
data
has
a
little
less
than
1000
rows,
it
would
meet
the
size
criteria
of
lbfgs,
which
is
best
for
small
or
medium
sizes
datasets
versus
alternatives
that
are
better
suited
for
larger
datasets.
However,
it
is
also
important
to
note
that
the
default
lbfgs
is
mentioned
as
performing
worse
on
poorly
scaled
datasets
or
having
one-hot
encoded
values
on
features
that
do
not
occur
frequently.
This
is
a
significant
realization
as
some
of
the
genetic
factors
that
are
less
common
(i.e.,
20
or
fewer
people
having
a
particular
gene
mutated)
might
not
occur
frequently
enough
for
the
model
to
be
appropriately
trained.
In
essence,
this
could
suggest
that
our
model
could
have
performed
even
better
if
we
could
reduce
the
number
of
features
or
combine
feature
classifications
to
create
a
smaller
amount
of
larger
groups.
For
instance,
we
could
have
two
groups
where
one
defined
the
presence
of
brain
tumors
that
are
classified
as
more
lethal,
and
another
group
defines
the
presence
of
less
malignant
brain
tumors.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
(info
on
lbfgs)
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Additional
Analysis
1.
Optimization
using
Mean
Square
Error
class
NewtonLogisticRegression:
def
__init__(self, max_iter=100, tol=1e-4, C=1.0):
self.max_iter = max_iter
self.tol = tol
self.C = C
def
sigmoid(self, z):
return
1 / (1 + np.exp(-z))
def
mse_loss(self, y_pred, y_true):
return
np.mean((y_pred - y_true)**2)
def
fit(self, X, y):
self.classes_ = np.unique(y)
self.w_ = np.zeros((X.shape[1], self.classes_.shape[0]))
for
i
in
range(self.max_iter):
for
k
in
range(self.classes_.shape[0]):
y_binary = np.where(y == self.classes_[k], 1, 0)
y_pred = self.predict_proba(X)
errors = y_pred[:,k] - y_binary
gradient = X.T.dot(errors) / X.shape[0]
hessian = X.T.dot(np.diag(y_pred[:,k]*(1-y_pred[:,k])).dot(X)) / X.shape[0] + np.eye(X.shape
delta = np.linalg.inv(hessian).dot(gradient)
self.w_[:,k] -= delta
if
np.sum(delta**2) < self.tol:
break
def
predict_proba(self, X):
z = X.dot(self.w_)
return
self.sigmoid(z)
def
predict(self, X):
y_pred_proba = self.predict_proba(X)
return
self.classes_[np.argmax(y_pred_proba, axis=1)]
from
sklearn.model_selection
import
train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
acc_results=[]
# Train the model using Newton's method
for
i
in
range(20):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=i)
model = NewtonLogisticRegression(max_iter=100, tol=1e-4, C=1.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
acc_results.append(accuracy)
# Create a boxplot of the accuracy results
plt.boxplot(acc_results)
plt.title("Accuracy Results")
plt.ylabel("Accuracy")
plt.show()
In
the
above
boxplot,
we
see
the
accuracy
results
for
20
tests
of
the
optimized
logistic
regression
being
performed
on
our
dataset
(using
MSE
and
Newton's
method).
We
see
that
the
highest
accuracy
score
with
our
given
dataset
is
approximately
65
%
,
and
the
median
is
around
63
%
,
which
is
very
similar
to
the
prior
multi-class
regression.
This
should
not
be
surprising,
considering
the
fact
that
mathematically,
MSE
is
equivalent
to
maximum
likelihood
as
the
act
of
minimizing
the
mean
square
error
will
be
the
same
as
maximizing
the
likelihood,
assuming
that
the
noise
when
testing
both
methods
does
not
change
much,
nor
does
the
sample
change.
However,
discrepancies
between
MSE
and
MLE
can
also
be
attributed
to
the
distribution
of
noise
in
a
dataset,
which
should
be
normally
distributed.
For
instance,
if
a
doctor
makes
a
consistent
mistake
of
marking
a
particular
gene
marker
as
existing
in
a
patient
when
it
truly
doesn't,
this
could
cause
variation
in
the
data
to
be
skewed.
This
is
important
to
note
as
non-normal
variations
in
the
data
could
explain
why
our
MSE
accuracies
are
overall
skewed
slightly
lower
than
the
MLE.
https://www.coursera.org/lecture/state-estimation-localization-self-driving-cars/lesson-3-least-squares-and-the-
method-of-maximum-likelihood-XcOb7
https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-
Learning-2006.pdf
(chapter
1.2.4